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Convergence  Behavior  of  Temporal  Difference  Learning 


Raj  P.  Malhotra 

The  University  of  Dayton,  Electrical  Engineering  Department 


Abstract— Temporal  Difference  Learning  is  an 
important  class  of  incremental  learning  procedures 
which  learn  to  predict  outcomes  of  secjuential 
processes  through  experience.  Although  these 
algorithms  have  been  used  in  a  variety  of  notorious 
intelligent  systems  such  as  Samuel  's  checker-player 
and  Tesauro 's  Backgammon  program,  their 
convergence  properties  remain  poorly  understood.  This 
paper  provides  a  brief  summary  of  the  theoretical  basis 
for  these  algorithms  and  documents  observed 
convergence  performance  in  a  variety  of  experiments. 
The  implications  of  these  results  are  also  briefly 
discussed. 

INTRODUCTION 

Problems  involving  sequential  processes  are 
encountered  in  a  wide  variety  of  interesting 
engineering  situations.  In  heuristic  search  applications 
one  attempts  to  learn  an  evaluation  function  which 
predicts  the  utility  of  searching  in  a  certain  region  of 
the  search  space.  Game-like  decision  processes  involve 
learning  to  predict  outcomes  based  upon  current  state. 
Difficult  modeling  tasks  and  pattern  recognition 
problems  often  involve  learning  process  complexities 
through  observing  sequential  behavior.  The  common 
thread  in  these  involves  learning  to  predict  process 
outcomes  and  sequential  behavior.  Often,  the  goal  of 
prediction  is  not  an  end  in  itself,  but  rather  a 
prerequisite  for  effective  control  of  some  complex, 
sequential  system. 

For  problems  in  which  aprior’  knowledge  is  limited, 
unreliable,  or  unavailable.  Reinforcement  Learning 
methods  ([Klopf],[Sutton],[Watkins])  have  received 


increasing  attention  as  viable  control  mechanisms 
which  can  achieve  near-optimal  performance. 
Mathematically,  these  algorithmic  methods  may  be 
viewed  as  a  form  of  Iterative,  Stochastic  Dynamic 
Programming  in  which  one  attempts  to  approximate 
value  iteration  (or  policy  iteration)  as  process 
characteristics  are  learned  through  experience.  The 
hallmark  of  reinforcement  learning  methods  is  their 
treatment  of  actual,  experienced  state  transitions  and 
rewards  as  unbiased  estimators  of  the  statistically  true 
values.  This  provides  for  incremental  calculations 
which  have  been  shown  to  theoretically  converge  in  the 
limit  to  optimal  predictions.  An  important  question 
which  has  to  do  with  the  nature  of  the  convergence  of 
these  algorithms  (e.g.,  convergence  requirements  and 
convergence  behavior  as  a  function  of  some 
appropriate  parameter  space). 

Temporal  Difference  Learning  [Sutton]  has  received 
increased  attention  in  recent  years,  as  a  type  of 
reinforcement  learning  which  learns  to  predict 
outcomes  of  markov  processes  with  delayed  and 
accrued  rewards.  The  significance  of  this  method  is 
that  it  can  learn  to  address  the  difficult  temporal  credit 
assignment  problem  in  which  one  must  assign  credit 
(or  blame)  to  decisions  when  the  feedback  is  noisy  and 
delayed.  This  is  believed  to  describe  many  real-world 
situations  in  which  more  conventional,  supervised 
learning  methods  are  less  efficient  (or  not  viable) 
because  of  the  reliance  on  the  availability  of  accurate 
feedback  at  each  decision  point.  In  the  absence  of  such 
feedback,  the  task  of  calculating  optimal  actions  at 
each  decision  point  becomes  formidable— a  requirement 
for  farsighted,  predictive  considerations  emerges. 
Whereas  conventional  prediction-learning  methods 
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assign  credit  (value  of  decisions)  by  means  of  the 
difference  between  predicted  and  actual  outcomes. 
Temporal  Difference  Learning  assigns  credit  by  means 
of  the  difference  between  temporally  successive 
predictions.  This  allows  for  more  efficient  learning. 

Although  Temporal  Difference  Learning  has  been  used 
in  several  famous  machine  intelligence  systems 
([Samuel], [Tesauro]),  their  convergence  properties 
remain  poorly  understood.  In  particular,  proofs  abound 
which  show  that  Temporal  Difference  Learning 
converges  to  the  optimal  value  function,  but  only  given 
infinite  training  time.  Of  course,  in  practice  one  never 
has  infinite  training  time.  Therefore,  a  more  cogent 
question  for  the  practitioner  is  "what  is  the 
convergence  behavior  of  temporal  difference  learning 
when  training  times  are  finite?"  That  is  the  question  to 
which  this  paper  is  dedicated. 

We  include  several  considerations  in  the  term 
"convergence  behavior":  These  include  performance 
metrics  such  as  the  rate  of  the  probabilistic 
convergence  of  predictions  and  the  root  mean-squared 
error  of  predictions  at  deterministic  stopping  times 
versus  algorithmic,  synthesis  parameters  involving 
learning  update  rate  and  an  important  eligibility 
horizon  effecting  the  spread  of  credit  or  blame  over  the 
sequence  of  states.  The  interactions  of  these  factors 
with  system  parameters  including  differing  amounts  of 
process  noise,  state  transition  and  cost/reward 
variances  will  be  briefly  highlighted  through 
simulation  results.  The  intent  of  this  presentation  is  to 
document  general  empirical  results  and  to  motivated 
further  (more  rigorous)  studies.  Before  this  is 
undertaken,  however,  a  brief  theoretical  foundation 
shall  be  provided. 

THEORETICAL  FOUNDATIONS 

The  original  motivation  for  the  development  of 
temporal  difference  learning  methods  involved  the 
temporal  credit  assignment  problem.  There,  one 
experiences  a  series  of  state  transitions  which  terminate 
when  an  absorbing  state  is  reached.  There  is  a  cost,  z, 
associated  with  the  terminal  state;  all  other  states  (and 
transitions)  are  nominally  assumed  to  have  zero  cost. 
The  goal  of  the  Temporal  Difference  Learner  is  to 
derive  a  sequence  of  predictions  {Pi,P2,...,Pn}  of  the 
final  outcome  based  upon  a  sequence  of  experienced 
states,{xi,x2,...}  (as  shown  in  figure  1). 


Figure  1  -  Conceptual  depiction  of  TDL 


The  sequence  of  states  is  assumed  to  be  a  markov 
process  in  which  state  transition  probabilities  are 
dependent  only  upon  the  most  recent  state: 
p(xt+]ixt,xt.i,xt.2,.-,xi)  =  p(xt+i|xt).  The  learning 
agent  is  parameterized  by  a  vector  of  weights,  w,  which 
control  the  sequence  of  outcome  predictions  based 
upon  state  trajectories.  The  goal  is  to  find  the 
appropriate  weight  settings  so  that  the  prediction  values 
are  optimized.  The  optimal  prediction  values  are 
known  to  satisfy  the  Bellman  Equation: 

/’,*(.v)  =  min/  max[/?(x)  +  y  S |x)Pw(x)  (1 ) 

JC' 

where  R(x)  is  the  incremental  transition  cost  to  leave 
state  x,  x’  is  the  state  transitioned  to,  P*vv  is  the 
weight-parameterized  optimal  prediction  value,  and  y  is 
a  discount  factor  in  (0,1).  The  learning  procedure  is 
expressed  as  a  rule  for  updating  w: 

m 

vi{/7+l)=  M{n)  + 

t=\ 

In  practice  the  change  in  weights, Aw,  may  be  accrued 
over  a  complete  sequence,  until  a  termination  is 
reached.  A  typical  supervised  learning  method  would 
associate  each  state  with  the  final  outcome  and  perform 
gradient-descent  based  upon  each  pair  separately: 

Mt)=a{z-Pt)^J*t^  (3) 

where  a  is  an  update  rate  effecting  the  rate  of 
convergence  and  the  gradient,  ,  is  a  vector  of 

partial  derivatives  of  Pt  with  respect  to  w.  The 
difficulty  with  this  approach  is  that  the  error  terms, 
(z-Pt),  can  only  be  computed  at  the  end  of  each 
sequence,  when  an  outcome  is  achieved.  This  is 
computationally  expensive.  The  temporal  difference 
method  is  based  upon  equating  the  final  prediction 
errors  to  a  string  of  intermediate  prediction  errors 
which  can  be  computed  incrementally: 
m 

z-P,=  lPM-Pt^  (4) 

k  =  t 


A  key  point  is  that  the  individual  predictions  serve  as 
unbiased  estimates  of  the  final  outcome,  z.  By 
substituting  equation  (4)  into  equations  (3)  and  (2)  we 
derive  a  new  weight  update  equation: 
m 

\v<^w+  ^a{z-  Pi)V^P( 
t=\ 


m 


=  w  +  Y^d\ 


t=\  ^k=t 


fn 


(5) 


m 


t 


=  w  +  'Lcc{Pt+\  -  Pt) 

t=\  k=\ 


where  m  represents  the  number  of  state  transitions 
experienced  in  a  particular  Observation-Outcome 
Sequence  (OOS)— this  will  in  general  be  a  random 
variable.  Note  the  last  summation  term  in  (5);  this  is  an 
eligibilily  term  which  assigns  credit  over  a  string  of 
past  states.  Finally,  note  that  (5)  can  be  computed 
incrementally,  after  each  state  transition.  Weight 
updates  may  either  be  performed  after  the  OOS  has 
completed  {batch  mode)  or  alternatively,  after  each 
individual  state  transition  {on-line  learning). 

The  TD(A.)  family  of  learning  procedures  modifies 
equation  (5)  by  the  inclusion  of  an  exponential 
weighting  factor  on  the  eligibility,  ^  8  [0,1].  This 
results  in  a  new  weight  update  procedure  in  which  the 
incremental  weight  changes  are  given  by 

^w{t)  =a(/}+i  -  //)  Z  V ^Pt  (6) 

k~\ 


Equation  (6)  may  be  thought  of  as  an  error  term  for 
temporal  credit  assignment  coupled  with  a  gradient- 
descent  mechanism  for  structural  credit  assignment. 
The  last  summation  term  can  be  thought  of  as  a  state 
eligibility  factor,  et,  given  by 

k=\  k=\ 

("7) 

Note  that  this  eligibility  factor  may  also  be  computed 
recursively. 

Using  the  equations  given  above,  we  can  now  derive 
the  TD(A,)  learning  algorithm: 

1 .  Initialize  parameters;  x,  w,  X,  a 

2.  Observe  a  state  transition 

3.  Compute  new  prediction,  Pt=f(xt,wt) 


4.  Compute  the  new  eligibility,  et+i 

5.  Compute  incremental  weight  update  via  (6) 

6.  If  xt  is  not  an  absorbing  state,  Go  To  step  2 

The  incremental  weight  updates  computed  in  step  5 
may  either  be  immediately  applied  to  the  weights  (in 
the  case  of  on-iiue  leainiiig)  or  they  may  be  summed 
and  applied  to  the  weights  after  the  completion  of  an 
OOS  (in  the  case  of  batch  mode  learning). 

As  mentioned  in  several  recent  papers  ([Bertsekas], 
[Jaakkola]),  the  convergence  behavior  of  TD(A,) 
remains  poorly  understood.  One  way  to  begin  to  study 
the  convergence  of  the  predictions  (or  equivalently,  the 
weights)  to  the  optimal  prediction,  P*,  satisfying 
equation  (1),  is  to  examine  factors  effecting  the 
convergence  through  repeated  statistical  experiments. 
For  a  complete  and  rigorous  analysis  (a  considerable 
undertaking),  one  would  need  to  evaluate  the 
performance  of  TD(>0  over  the  complete  parameter 
space  which  includes' : 

•  Synthesis  Parameters:  exponential  eligibility 
factor,  X\  learning  update  rate,  a;  function 
approximation  method  for  Pt=f(xt,wt),  and  the 
internal  state  representation  comprising  xt 

•  System  Parameters:  characteristics  of  state 
transition  and  cost/reward  probability  density 
functions  (mean,  variance,  time-dependency, 
uniformity  over  the  state  space,  etc.); 
dimensionality/cardinality  of  the  “true”  state  space; 
nature  of  the  markov  processes  (topology,  number 
of  absorbing  states,  etc.);  initial  state,  etc.. 

There  are  several  alternatives  for  studying  the 
convergence  behavior  of  temporal  difference  learning. 
In  the  remainder  of  this  paper  we  shall  document  the 
results  of  high-level  experiments  which  produce 
several  interesting  insights  and  conjectures. 

EXPERIMENTAL  RESULTS 

In  order  to  establish  a  top-level  assessment  of  the 
significance  of  various  factors  upon  the  convergence  of 
TD(k),  several  experiments  were  conducted.  These 
involved  learning  to  predict  the  outcomes  of  a  markov 


*  these  lists  are  not  exhaustive;  the  determination  of  the 
proper  parameter  space  to  study  convergence  over  is 
worthy  of  extensive  research  in  itself. 


process  with  9  possible  states,  x',x2,...,x9.  The  state 
space  may  be  considered  as  a  string  in  which  the  end 
states  (xl,x^)  represent  two  different  outcomes 
(possibly  winning  and  losing  a  game).  As  indicated  in 
figure  2  below,  the  simulation  nominally  begins  in  state 
5  and  progresses  according  to  the  state  transition 
probabilities  until  a  terminal  state  is  reached. 


Figure  2  -  9-state,  moditled  Random  Walk 


This  markov  process  is  based  upon  the  simpler  random 
walk  process  described  and  studied  in  [Sutton].  The 
only  important  costs  associated  with  the  process  are 
those  given  by  the  transition  cost  function  going  into 
the  terminal  states.  The  concept  of  “spread”  ,  as 
illustrated  in  figure  2,  is  related  to  state  transition 
variance. 

Experiment  1 

The  first  set  of  experiments  involved  learning  to 
predict  outcomes  for  the  9-state  modified  random  walk 
with  a  variety  of  settings  for  the  system  parameters,  k 
and  a.  Specifically,  k  was  set  at  X=0,  A,=.3,  X=.8,  and 
^=1  while  a  was  allowed  to  vary  over  a  = 
{0,.1,.2,.3,.4,.5,.6}.  The  metric  used  to  judge 
convergence  was  root-mean-squared  error  (RMSE) 
taken  after  all  learning  was  completed. 

In  figure  3  below  we  can  view  the  average  RMSE 
taken  over  1 0  sets  of  simulations  in  which  the 
temporal  difference  learner  operated  over  2000  state 
transitions.  Note  that  the  use  of  a  fixed  number  of  state 
transitions,  rather  than  a  fixed  number  of  sequences 
(OOSs)  allows  for  a  more  even  comparison  (since 
OOSs  will  have  varying  lengths).  It  is  particularly 
interesting  to  note  the  erratic  behavior  of  the  TD(X,=1 ) 
curve.  This  curve  indicates  that  for  the  9-state, 


modified  random  walk  example,  TD(A.=1)  is  much 
more  sensitive  to  variations  in  a  than  are  the  other  k 
curves. 


Figure  3  -  Results  for  Experiment  1 


Experiment  2 

A  more  interesting  experiment  entails  observing 
changes  in  convergence  behavior  for  cases  where  there 
is  more  (or  less)  uncertainty  in  state  Iransitions.  In 
particular  we  will  examine  cases  where  the  spread  (as 
indicated  in  figure  2;  spread  is  related  to  state  transition 
variance)  is  set  to  3,  5,  and  7.  For  each  of  these  values 
the  probability  density  is  evenly  spread  among  the 
possible  transitions  (as  indicated  in  figure  2  for  the 
spread=3  case).  In  figure  4  we  plot  the  RMSE  values 
averaged  over  all  k  (X,=0,.3,.8,l). 


It  is  interesting  to  note  the  improvement  in  going  from 
a  spread  of  3  to  a  spread  of  5.  The  RMSE  values  do  not 


improve  significantly  when  going  to  spreacl=7.  This 
seems  to  indicate  that  an  increased  variance  in  state 
transition  offers  some  benefit;  the  process  becomes 
more  volatile,  and  singular  state  transitions  contain 
potentially  more  information.  However,  this  benefit 
wanes  as  more  volatility  is  introduced. 


Fig.  5a  -  RMSE,  weight  record  curves  for  spread=3 


Fig.  5b  -  RMSE,  weight  record  curves  for  spread=5 


Experiment  3 

A  third  experiment  was  run  in  which  the  learner  was 
allowed  to  partially  converge  for  a  given  state 
transition  probability  set.  Then  the  transition 
probabilities  were  changed  and  it  was  observed  how 
well  each  of  the  learners  could  recover  for  different 
setting  on  a  and  X.  The  result  was  somewhat  surprising 
and  almost  counter-intuitive.  As  shown  in  figure  6 
learning  with  large  X  values  was  superior  after  the 
sudden  change  in  transition  probabilities.  These  seems 
counter-intuitive,  as  one  would  expect  that  updating 


weights  over  a  shoner  horizon  would  be  superior.  This 
is  an  interesting  matter  for  further  study. 


Experiment  4 

A  final  set  of  experiments  was  performed  in  which  the 
learner  was  allowed  to  start  in  different  states.  Not 
surprisingly,  convergence  was  sensitive  to  the  starting 
state.  In  figure  7  below,  the  weights  for  lower  states 
(physically,  on  the  top  of  the  graph),  X1-X3,  learn  less 
rapidly  than  those  for  the  higher-numbered  states.  This 
is  because  the  process  is  started  in  state  xy  and  given 
the  nature  of  the  state  transition  probabilities 
(uniformly  distributed,  see  fig.  2),  the  lower-numbered 
states  will  occur  less  frequently. 


Figure  7  -  Weight  changes  (exp.  4) 


As  shown  in  figure  8,  the  RMSE  values  are  drastically 
different  for  this  experiment  than  for  some  of  the 
earlier  ones.  This  would  indicate  that  there  is  also  some 
interaction  between  X  and  the  starting  state. 
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Figure  8  -  RMSE  for  experiment  4 


[KLOPF]  Klopf,  A.  H.,  “A  Neuronal  Model  of 
Classical  Conditioning”,  Psychobiology,  vol.l6,  pp85- 
125  i988 


CONCLUSION 

In  this  paper  we  have  briefly  discussed  some 
preliminary  considerations  for  what  may  be  a  fruitful 
new  thrust  in  the  study  of  reinforcement  learning 
algorithms.  We  have  argued  that  convergence 
properties  are  not  well  understood  and  that  they 
warrant  further  investigations.  Several  simple 
experiments  were  performed  and  documented  in  order 
to  get  a  first  look  at  the  behavior  of  these  important 
learning  algorithms. 
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